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Abstract — As  transistor  sizes  shrink  and  we  approach 
the  “end  of  Moore’s  law”,  interconnects — both  on-chip 
and  off-chip — will  represent  the  biggest  bottleneck  for 
embedded  systems  designers.  Several  groups  are  research¬ 
ing  optical  interconnects  to  cope  with  this  trend.  Optical 
interconnects  enable  new  system  architectures.  These  new 
architectures  in  turn  require  new  methods  for  high-level 
application  mapping  and  hardware/software  co-design.  In 
this  presentation,  we  discuss  high-level  scheduling  and 
interconnect  topology  synthesis  techniques  for  embedded 
multiprocessors.  We  focus  on  designs  that  are  streamlined 
for  one  or  more  digital  signal  processing  (DSP)  applica¬ 
tions.  That  is,  we  seek  to  synthesize  an  application-specific 
interconnect  topology  for  a  multiprocessor  DSP  design.  We 
show  that  flexible  interconnect  topologies  that  allow  single¬ 
hop  communication  between  processors  offer  advantages 
for  reduced  power  and  latency. 

We  have  previously  shown  that  multiprocessor  schedul¬ 
ing  algorithms  can  deadlock  in  the  general  case  of  a 
topology  graph  that  is  not  strongly  connected,  or  if 
communication  is  limited  to  be  single  hop.  We  have  also 
demonstrated  an  efficient  algorithm  that  can  be  used 
in  conjunction  with  existing  scheduling  algorithms  for 
avoiding  this  deadlock  [1],  In  this  presentation  we  discuss 
the  advantages  of  performing  application  scheduling  and 
interconnect  synthesis  jointly,  and  present  a  probabilistic 
scheduling/interconnect  algorithm  utilizing  graph  isomor¬ 
phism  to  pare  the  design  space.  We  demonstrate  the  perfor¬ 
mance  advantages  that  an  application-specific  interconnect 
topology  can  produce  for  several  DSP  benchmarks. 

Index  Terms — interconnect  synthesis,  multiprocessor, 
scheduling. 

I.  Introduction 

Interconnect  considerations  are  important  for  today’s 
embedded  systems  designs.  As  transistor  density  in¬ 
creases,  more  functional  units  can  be  placed  on  a  single 
chip,  and  the  number  of  possible  interconnections  (links) 
between  them  increases.  The  longest  wires  on  the  chip 
are  usually  due  to  these  links.  These  wires  contribute  to 
delay  and  limit  the  maximum  achievable  clock  rate.  Also, 
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routing  these  interconnections  is  a  significant  challenge 
for  the  electronic  design  automation  tools. 

Embedded  systems  typically  run  a  limited  and  fixed 
set  of  applications.  We  can  use  this  application-specific 
information  to  optimize  the  interconnection  network. 
For  our  purposes,  an  optimal  network  is  defined  in  the 
context  of  a  set  of  applications  and  constraints.  The 
constraints  may  include  the  latency,  throughput,  and 
power  consumption  for  the  given  applications,  along 
with  cost  and  area  constraints  of  the  overall  system. 
A  key  distinguishing  feature  to  our  algorithm  is  that 
we  perform  the  application  scheduling  and  interconnect 
synthesis  jointly. 

A.  Optical  Interconnects 

In  recent  years,  optics  have  played  an  increas¬ 
ing  role  in  multiprocessor  systems.  Commercial  high- 
performance  computers  now  use  fiber  ribbons  to  connect 
multiple  processing  nodes.  Other  examples  include  stor¬ 
age  area  networks  using  fiberchannel,  and  optical  clock 
distribution  to  reduce  clock  skew  across  a  chip.  Programs 
such  as  the  DARPA  VLSI  Photonics  [2]  program  are 
pushing  to  integrate  photonics  technology  on  a  single 
chip.  Intel  is  currently  backing  an  effort  to  bring  “fiber- 
to-the-processor”  [3].  The  idea  is  to  break  the  processor 
to  cache  bottleneck  by  using  an  optical  waveguide  inte¬ 
grated  on  the  processor  chip. 

B.  Connection  Topologies 

Electrically  connected  multiprocessor  systems  gen¬ 
erally  have  a  regular  interconnection  pattern,  due  to 
the  physical  constraints  imposed  by  two-dimensional 
circuit  board  layout.  Some  examples  include  ring,  mesh, 
bus,  and  hypercube  interconnect  topologies.  Using  these 
topologies,  communication  between  remote  processors 
requires  multiple  hops,  which  increases  both  latency  and 
power,  and  increases  contention  throughout  the  network. 

In  contrast,  optically  connected  multiprocessors,  par¬ 
ticularly  those  utilizing  free  space  optics  and  three  di¬ 
mensions,  are  free  to  utilize  arbitrarily  irregular  inter¬ 
connection  networks.  Once  the  signal  is  in  the  optical 
domain,  there  is  very  little  attenuation,  so  the  energy 
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Application 

N 

A  (£)(%) 

A  (M)(%) 

FFT1 

7 

16 

8 

Karp  10 

6 

24 

4 

Irr 

8 

16 

(2) 

Qmf4 

7 

32 

3 

NN16-3-4 

8 

58 

2 

Suml 

6 

1 

4 

Laplace 

7 

4 

(3) 

FFT2 

7 

12 

2 

TABLE  I 

Reduction  in  communication  energy  (A(E))  and 

MAKESPAN  INCREASE  (A (M))  OF  SINGLE  HOP  SCHEDULE  OVER 
THREE-HOP  SCHEDULE. 


required  to  transmit  a  unit  of  data  is  essentially  indepen¬ 
dent  of  distance.  The  required  energy  instead  is  a  func¬ 
tion  of  the  number  of  electrical-to-optical  conversions 
that  must  be  performed  [4],  which  in  turn  is  determined 
by  the  number  of  hops.  With  single-hop  schedules  the 
overhead  associated  with  routing  data  through  interme¬ 
diate  processors  is  eliminated.  Furthermore,  due  to  the 
flexibility  of  the  communication  medium,  it  is  generally 
possible  to  avoid  multi-hop  communication  operations 
by  simply  activating  direct  communication  channels  be¬ 
tween  the  source  and  destination  processors.  Together, 
these  properties  make  it  desirable  to  limit  the  number 
of  hops  per  communication  operation  when  exploring 
configurations  (interconnection  patterns  and  task  graph 
mappings)  for  an  optically  connected,  embedded  multi¬ 
processor. 

In  order  to  quantify  this  effect,  we  scheduled  sev¬ 
eral  DSP  benchmark  applications  using  our  modified 
scheduling  technique,  which  takes  the  number  of  hops  as 
an  input  parameter.  We  scheduled  the  benchmarks  with 
hop  constraints  of  one  hop  and  three  hops,  and  compared 
the  communication  energy  required.  For  our  purposes, 
we  assumed  all  communication  tasks  transferred  the 
same  number  of  bits,  so  the  energy  cost  of  all  IPC 
actors  was  equal.  Table  §  shows  the  reduction  in  the 
required  communication  energy  for  single-hop  schedules 
over  three-hop  schedules  for  the  benchmark  applications. 
For  these  benchmarks,  we  found  that  any  undesirable 
effect  on  the  makespan  of  the  additional  constraint  for 
single-hop  schedules  was  very  small,  as  can  be  seen  in 
Table  Q.  In  two  of  the  benchmarks  (Irr  and  Laplace),  the 
makespan  was  in  fact  better  (lower)  when  we  limited  the 
scheduler  to  single  hops. 

C.  Interconnect  Synthesis  Algorithm 

We  present  a  genetic  algorithm  for  synthesizing  effi¬ 
cient  interconnection  networks  for  embedded  multipro¬ 
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Fig.  1.  GA  output  versus  generation. 

cessors.  The  algorithm  works  in  conjunction  with  a  list 
scheduling  algorithm  to  jointly  optimize  both  the  sched¬ 
ule  and  the  interconnect  topology.  The  algorithm  is  able 
to  account  for  different  distributions  of  local  vs.  global 
(long)  interconnect  routing  tracks  via  a  processor  fanout 
constraint.  It  uses  graph  isomorphism  to  significantly 
pare  the  search  space  in  order  to  search  more  efficiently. 

We  evaluated  our  interconnect  synthesis  algorithm  on 
several  DSP  benchmark  application  graphs.  Figure  [T] 
shows  the  convergence  of  the  GA  vs.  generation  number 
for  an  FFT  application  with  a  maximum  allowed  fanout 
of  4.  In  this  plot  the  y-axis  refers  to  the  schedule 
makespan  of  the  best  interconnection  topology  found  for 
a  given  generation.  We  see  that  GA  was  able  to  reduce 
the  makespan  by  almost  a  factor  of  two  over  the  best 
topology  in  the  initial  population. 

We  calculated  how  the  makespan  improves  as  the 
maximum  fanout  constraint  is  increased.  This  amounts 
to  an  area/performance  tradeoff  in  the  system.  We  also 
compared  the  performance  of  systems  with  topologies 
available  with  electrical  interconnects  vs.  optical  inter¬ 
connects.  These  topics  will  be  described  in  the  presen¬ 
tation. 
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As  transistor  sizes  shrink  and  we  approach  the  "end  of  Moore's  law",  interconnects,  both 
on-chip  and  off-chip,  will  represent  the  biggest  bottleneck  for  embedded  systems 
designers.  Several  groups  are  researching  optical  interconnects  to  cope  with  this  trend. 
Optical  interconnects  enable  new  system  architectures.  These  new  architectures  in  turn 
require  new  methods  for  high-level  application  mapping  and  hardware/software  co¬ 
design.  In  this  presentation,  we  discuss  high-level  scheduling  and  interconnect 
topology  synthesis  techniques  for  embedded  multiprocessors.  We  focus  on  designs  that 
are  streamlined  for  one  or  more  digital  signal  processing  (DSP)  applications.  That  is,  we 
seek  to  synthesize  an  application-specific  interconnect  topology  for  a  multiprocessor 

DSP  design.  We  show  that  flexible  interconnect  topologies  that  allow  single-hop 
communication  between  processors  offer  advantages  for  reduced  power  and  latency. 

We  have  previously  shown  that  multiprocessor  scheduling  algorithms  can  deadlock  in 
the  general  case  of  a  topology  graph  that  is  not  strongly  connected,  or  if  communication 
is  limited  to  be  single  hop.  We  have  also  demonstrated  an  efficient  algorithm  that  can  be 
used  in  conjunction  with  existing  scheduling  algorithms  for  avoiding  this  deadlock.  In 
this  presentation  we  discuss  the  advantages  of  performing  application  scheduling  and 
interconnect  synthesis  jointly,  and  present  a  probabilistic  scheduling/interconnect 
algorithm  utilizing  graph  isomorphism  to  pare  the  design  space.  We  demonstrate  the 
performance  advantages  that  an  application-specific  interconnect  topology  can  produce 
for  several  DSP  benchmarks. 


Deadlock  and  Flexibility 


Existing  scheduling  algorithms  assume  every  pair  of  processors  can  communicate 
Scheduling  not  well  studied  for  irregular  interconnection  networks 
Can  deadlock  for  arbitrary  topologies 

•  Developed  algorithms  to  adapt  existing  list  scheduling  algorithms  to  avoid 
deadlock 

Some  scheduling  moves  have  greater  flexibility 


Partial  Schedule 
A  on  proc.  2 
B  on  proc.  1 


Topology  graph 


Partial  Schedule 
A  on  proc.  2 
B  on  proc.  3 


Constraint  Sets 
F[A]  =  {2} 

F[B]  =  {1} 

F[F]  =  {1} 

F[D]  =  {1,2} 

F[E]  =  {1,2,3} 

F[C]  =  {1,2,3} 
Flexibility  =  11/24 


Application  graph 


Constraint  Sets 
F[A]  =  {2} 

F[B]  =  {3} 

F[F]  =  {0,3} 

F[D]  =  {1,2,3} 

F[E]  =  {0,1, 2, 3} 
F[C]  =  {0,1, 2, 3} 
Flexibility  =  15/24 
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Lower  latency 
and  communication 
energy  for  topology  1 


Latency  2 


Low  Hop  Communication  Saves  Energy 


karpl  0  -  7  processors  -  random  link  removal 


number  of  graphs 


Link  Synthesis  Algorithm 


*  Developed  both  deterministic  and  evolutionary  (GA)  algorithms 

•  GA  objective  utilizes  DLS  scheduling  modified  with  flexibility  metric 

•  Crossover  operators  allow  fan-out  constraints  to  be  preserved 

*  Use  graph  isomorphism  to  pare  the  design  space 


Paring  the  design  space 


5  nodes 


Link  synthesis  results 


neural  net  (8  input  -  2  hidden  -  4  output)  on  10  processors 


*  Consider  only  isomorphically  unique 
graphs 

*  Reduction  by  orders  of  magnitude 


GA  (red)  outperforms  deterministic 
over  a  range  of  topologies 
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•  Develop  software  tools  and  algorithms  to  efficiently  map 
digital  signal  and  image  processing  (“DSP”)  applications 
onto  Systems  on  Chip. 

-  Joint  scheduling/interconnect  synthesis  optimization 

-  Scheduling  for  low-hop  communication  on  arbitrary 
topologies 

-  Synthesize  an  optimal  application-specific  interconnect 
topology 
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Scheduling 


•  Task  graph  G(V,  E),  u  eV,  e  e  E 

-  Dataflow  specification 

-  General  point-to-point  networks 

•  Topology  graph  T(P,  L),  p  e  P,  l  e  L 

-  Link  constraints 

-  Processor  fanout  constraints 

-  i  =  (pi,pj)  assigned  weights — delay  and  power 

-  <f(G,7»  =  £e6B( 

5^/groute(e)  ^bit(0^ 

•  Communication  hop  limit 


UNIVERSITY  OF 

'MARYLAND 


Neal  K.  Bambha  HPEC  September  28,  2004 


3/6 


UNIVERSITY  OF 

MARYLAND 


Effect  of  Topology 


Topology  1 


Schedule  1 


Topology  2 


Application  graph 


Lower  makespan  and  communication 
energy  for  topology  1 


Schedule  2 
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Low  Hop  Communication  Saves 

Energy 

karpIO  -  7  processors  -  random  link  removal 


Compare  communication  energy  across  a  range  of  randomly 
generated  topologies  with  single-hop  and  3-hop  limit. 
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Application-Specific  Interconnect 

Topologies 

•  Design  constraints  for  optical  interconnects 

-  Topology — total  links,  maximum  fanout 

-  Performance — throughput,  power 

•  Joint  schedule/topology  optimization 

-  GA  generates  population  of  solution  candidates  T(P,  L) 

-  Scheduler  evaluates  fitness  of  each  T 

*  DLS  adapted  for  arbitrary  topologies 

*  Avoids  deadlock,  calculates  flexibility 

*  Contructs  hop-limited  schedules 

-  Given  constraints  on  T,  miximize  performance 

-  Given  constraints  on  performance,  optimize  T 
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